Dataset

This dataset is created based on data file retrieved from Kaggle - Massive Yahoo Finance Data. The collected data includes 5 years of daily share prices with low, high, open and close values, trade volumes and dividends for the top 500 companies. Since analyzing shares based on broader category such as Market capitalization will give meaningful insight on the performance, additional category called Market-Cap and monthly percentage of returns are derived and added to this data set.

Goal of Analysis

The goal of this analysis is that overall market trend as a whole as well as based on broader category such as Market-Cap, dividend provided by companies and volatility of daily shares.

Monthly performance vs Market Cap

Companies are categorized based Market capitalization derived from latest available data.

Categorizing Shares based on Market Cap

Market capitalization is calculated and segregated as per below logic. Market Capitalization = Close value of Company’s Share X Traded shares volume Category is defined as per below table.

options(digits=2)
sum_data <- data.frame(
  rangeValues = c("Above 10 BN","from 2 BN upto 10 BN","from 300 MM upto 2 BN",
                  "from 50 MM upto 300 MM","less than 50 MM")
)
colnames(sum_data) <- c( "Range of Market Capitalization")
rownames(sum_data) <- c("Large-Cap", "Mid-Cap", "Small-Cap", "Micro-Cap", "Others" )
sum_data
##           Range of Market Capitalization
## Large-Cap                    Above 10 BN
## Mid-Cap             from 2 BN upto 10 BN
## Small-Cap          from 300 MM upto 2 BN
## Micro-Cap         from 50 MM upto 300 MM
## Others                   less than 50 MM
# Monthly performance returns for top 500 Companies based on current Market capitalization
latestAvailableDate <- max(stock_details_5_years_tbl$Date)

#Define Market cap category
markeCapRanges <- c(0,5e+07, 3e+08, 2e+09, 1e+10, Inf)


#Categorize companies by Market Cap
marketCapCategories <- c("Large-Cap", "Mid-Cap", "Small-Cap", "Micro-Cap", "Others" )
companiesCategorizedByMarketCap <- stock_details_5_years_tbl |> 
  filter(Date == latestAvailableDate) |>
  mutate(MarketCapVal = Close * Volume) |>
  mutate(MarketCap = case_when( 1e+10 <= MarketCapVal & MarketCapVal < Inf ~ marketCapCategories[1],
                                2e+09 <= MarketCapVal & MarketCapVal < 1e+10 ~ marketCapCategories[2],
                                3e+08 <= MarketCapVal & MarketCapVal < 2e+09 ~ marketCapCategories[3],
                                5e+07 <= MarketCapVal & MarketCapVal < 3e+08 ~ marketCapCategories[4],
                                0 <= MarketCapVal & MarketCapVal < 5e+07 ~ marketCapCategories[5] )) |>
  arrange(desc(MarketCapVal)) |>
  select(Company, MarketCapVal, MarketCap )



# Monthly returns by MarketCap
monthly_returns_5_years_marketcap <-
  monthly_returns_5_years |> 
  left_join(companiesCategorizedByMarketCap, by = "Company")

fig <- list()
for(i in 1:length(marketCapCategories)) {
  fig[[i]] <- plot_ly(monthly_returns_5_years_marketcap |> 
                        filter(MarketCap == marketCapCategories[i] ),
                       name = paste("MarketCap:", marketCapCategories[i] ),
          x=~Month, y=~monthlyReturnsPct, type = "box",
          marker = list(color = "black")) |>
    layout(title = paste("Performance of", marketCapCategories[i] , "Companies"),
           xaxis = list(title = "Month", tickvals = ~Month, ticktext = ~Month),
           yaxis = list(title = "Monthly Returns %"))
  subplot(fig[[i]])
}

subplot(fig[[1]])
subplot(fig[[2]])
subplot(fig[[3]])
subplot(fig[[4]])
m <- list(
  l = 50,
  r = 50,
  b = 100,
  t = 100,
  pad = 50
)
#subplot(fig, nrows = 5) |> 
#  layout( title = "Monthly returns vs MarketCap", 
#          autosize = F, width = 800, height = 2000, margin = m)

Monthly performance box plots based on market capitalization provides further insight into performance of the stock market.

  • Performance box plots of Large-cap and mid-cap companies indicates that except market crash around years 2019 and 2022 and few negative growths, mostly showing positive growth only.
  • Performance box plot of Small-Cap and Micro-Cap companies indicates that as market capitalization becomes smaller, there are more outliers performing well/worse way ahead than shares within IQR. By carefully identifying those outliers, trades can be benefited on investing them. ## Dividends provided based on Market Cap
#Dividends
companiesWithDividends <- stock_details_5_years_tbl |> 
  subset(Dividends > 0) |> select(Company) |> distinct()
 
MarketCapDividends <-  companiesCategorizedByMarketCap |> 
    left_join(companiesWithDividends, by = "Company", keep = TRUE) |>   
    mutate(DividendProvided = if_else(is.na(Company.y), "WithDividend",
                                        "WithoutDividend")) |>
#    rename(MarketCap = MarketCap.x) |> 
    dplyr::select( MarketCap, DividendProvided) |>
  group_by(DividendProvided) |> table() |> as.tibble() |>
  pivot_wider(names_from = DividendProvided, values_from = n)
MarketCapDividends
## # A tibble: 5 Ă— 3
##   MarketCap WithDividend WithoutDividend
##   <chr>            <int>           <int>
## 1 Large-Cap            1               0
## 2 Micro-Cap           52             209
## 3 Mid-Cap              3               3
## 4 Others              27             145
## 5 Small-Cap           23              28
fig <- plot_ly(MarketCapDividends, x=~MarketCap, y = ~WithDividend, 
        name = "With Dividend", type = "bar") |>
  add_trace(y = ~WithoutDividend, name = "Without Dividend") |>
  layout(barmode = "stack", title = "Total Companies",
         xaxis = list(categoryorder = "array", categoryarray = marketCapCategories,
                      title = "Market Cap"),
         yaxis = list(title = "Freqeuncy"))
   
fig

From the above plot, companies with larger market capitalization seems providing more dividends to their shareholders than the smaller ones.

Percentage of swing in daily share prices

options(digits=2)
# Percentage swing in daily share price within all shares for last five years

DailyPriceSwing <-
stock_details_5_years_tbl |> 
  mutate( VariationPercent = round((High - Low)*100/Close, 0)) |>
  dplyr::select(VariationPercent, Company) |>
  arrange(VariationPercent) 

  fig <- plot_ly(x = DailyPriceSwing$VariationPercent,  
                 type = "histogram") |>
    layout(xaxis = list(range = c(0,10)), yaxis = list(range = c(0,3e+05)))
fig
print(paste("Mean:", round(mean(DailyPriceSwing$VariationPercent), 2), "SD:", 
            round(sd(DailyPriceSwing$VariationPercent), 2)))
## [1] "Mean: 2.57 SD: 1.92"

Percentage distribution in daily share price within all shares for last five years is following normal distribution which is skewed to the right which indicates that most of the swings within a day is only less than or equal to 2%.

Central Limit Theorem

As per central limit theorem, the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough. The below plots depicting distribution on percentage of swing in daily share prices for various sample sizes adhere to this theorem.

options(digits=2)
 getMeansOfSamples <- function(x, sampleSizes, samples) {
    set.seed(4735)
    xbars <- matrix(rep(0, samples * length(sampleSizes)),nrow = length(sampleSizes), byrow = TRUE)
    for(i in 1:length(sampleSizes)){
      for(j in 1:samples) {
        xbars[i,j] <- mean(sample(x, sampleSizes[i], replace = FALSE))
      }
    }
    xbars
  }

  samples = 1000
  sampleSizes = c(10, 20, 30, 40)

  # Daily Share price swing percentage for random stock in five years 
  xbars <- getMeansOfSamples(DailyPriceSwing$VariationPercent, sampleSizes, samples)
  fig <- list()
  for(i in 1:length(sampleSizes)) {
    print(paste("Mean:", round(mean(xbars[i,]), 2), "SD:", round(sd(xbars[i,]), 2)))
    fig[[i]]<- plot_ly(x = ~xbars[i,], type = "histogram",
                       name = paste("Samplesize", 
                                    sampleSizes[i], sep = ":" )) |>
      layout(xaxis = list(range = c(1,5), title = "Daily Share price variation Percentage"), 
             yaxis = list(range = c(0,100), title = "Density")) 
    
  }
## [1] "Mean: 2.52 SD: 0.58"
## [1] "Mean: 2.57 SD: 0.45"
## [1] "Mean: 2.56 SD: 0.35"
## [1] "Mean: 2.55 SD: 0.3"
  subplot(fig) |> layout( title = "Daily Share price variation Percentage")

The above plot shows the distribution of daily price in shares is skewed to the right side similar to the original plot.

Sampling Methods

There are three type of sampling methods are used to analyse similar analysis as above. They are simple random sampling without replacement, systematic unequal probabilities and stratified sampling with equal sized data.

options(digits=2)
library(sampling)
# Various sampling methods
 
 #Simple random Sampling without replacement
  sampleSize <- 50
  populationSize <- length(DailyPriceSwing$VariationPercent)
  set.seed(4735)
 s <- srswor(sampleSize, populationSize) 
 rows <- rep(seq(1:populationSize), s)

 par(mfrow = c(2,2))
 
 plot_ly(y =~prop.table(table(DailyPriceSwing$VariationPercent[rows])), type = "bar",
         name = "Simple Randon Sampling without replacement") |>
   layout(title = "Simple Randon Sampling without replacement",
          xaxis = list(range = c(0,15), title = "Daily Price Swing %"), 
          yaxis = list(range = c(0,0.4), title = "Proportion")) 
 # Systematic sampling
library(sampling)
 pik <- inclusionprobabilities(DailyPriceSwing$VariationPercent, sampleSize)

s <- UPsystematic(pik)

sytematicSamples <- DailyPriceSwing$VariationPercent[s!=0]


plot_ly(y =~prop.table(table(sytematicSamples)), type = "bar",
        name = "Systematic Sampling un-equal Probabilities") |>
  layout(title = "Systematic Sampling un-equal Probabilities",
         xaxis = list(range = c(0,15), title = "Daily Price Swing %"), 
         yaxis = list(range = c(0,0.4), title = "Proportion")) 
# Stratified, equal sized strata

sts <- sampling::strata(DailyPriceSwing, stratanames = c("Company"), size = rep(3, 500),
       method = "srswor", description = FALSE)

stSamples <- sampling::getdata(DailyPriceSwing, sts)


plot_ly(y =~prop.table(table(stSamples$VariationPercent)), type = "bar",
        name = "Stratified Sampling") |>
  layout(title = "Stratified Sampling",
         xaxis = list(range = c(0,15), title = "Daily Price Swing %"), 
         yaxis = list(range = c(0,0.4), title = "Proportion")) 

From the above plots it’s clear that, those three samples on the original data also distributed normally and skewed to the right similar to the original one.

Conclusions

From the above analysis on last 5 years of share market data for top 500 companies below observations are made.

  • Large and Mid-cap companies have least number of outliers when it comes to monthly performance of shares and about half or more companies are giving dividends to their shareholders.
  • As size of Market capitalization becomes smaller, Share holders has to pay more attention to outliers which can perform exceptionally well or worse. Most of these companies will not provide dividends to the shareholders.
  • Intraday swings in share prices are mostly <=2%